Challenges in Developing a Rule based Urdu Stemmer

نویسندگان

  • Sajjad Ahmad Khan
  • Waqas Anwar
  • Usama Ijaz Bajwa
چکیده

Urdu language raises several challenges to Natural Language Processing (NLP) largely due to its rich morphology. In this language, morphological processing becomes particularly important for Information Retrieval (IR). The core tool of IR is a Stemmer which reduces a word to its stem form. Due to the diverse nature of Urdu, developing stemmer is a challenging task. In Urdu, there are large numbers of variant forms (derivational and inflectional forms) for a single word form. The aim of this paper is to present issues pertaining to the development of Urdu stemmer (rule based stemmer).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Light Weight Stemmer for Urdu Language: A Scarce Resourced Language

Stemming is a procedure that conflates morphologically related terms into a single term without doing complete morphological analysis. Urdu language raises several challenges to Natural Language Processing (NLP) largely due to its rich morphology. The core tool of information retrieval (IR) is a Stemmer which reduces a word to its stem form. Due to the diverse nature of Urdu, developing its ste...

متن کامل

Rule Based Urdu Stemmer

This paper presents Rule based Urdu Stemmer. In this technique rules are applied to remove suffix and prefix from the inflected words. Urdu is well spoken language all over the world but less work has been done on Urdu stemming. Stemmer helps us to find the root of the inflected word. Various possibilities of inflected words like ںو (vao+noon-gunna), ے (badi-ye), ںای (choti-ye+alif+noon-gunna) ...

متن کامل

A Rule based Stemming Method for Multilingual Urdu Text

Urdu is a national language of Pakistan and spoken more than 200 million people use it as a verbal and written communication. There exists a large amount of unstructured Urdu textual data in the world; by applying data mining techniques useful information can be achieved. However it seriously lacks processing capabilities to develop innovative systems based on Urdu language. In this paper, auth...

متن کامل

Assas-band, an Affix-Exception-List Based Urdu Stemmer

Both Inflectional and derivational morphology lead to multiple surface forms of a word. Stemming reduces these forms back to its stem or root, and is a very useful tool for many applications. There has not been any work reported on Urdu stemming. The current work develops an Urdu stemmer or Assas-Band and improves the performance using more precise affix based exception lists, instead of the co...

متن کامل

Template based affix stemmer for a morphologically rich language

Word stemming is one of the most significant factors that affect the performance of a Natural Language Processing (NLP) application such as Information Retrieval (IR) system, part of speech tagging, machine translation system and syntactic parsing. Urdu language raises several challenges to NLP largely due to its rich morphology. In Urdu language, stemming process is different as compared to th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011